Automated Phrase Mining from Massive Text Corpora

نویسندگان

  • Jingbo Shang
  • Jialu Liu
  • Meng Jiang
  • Xiang Ren
  • Clare R. Voss
  • Jiawei Han
چکیده

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus. Phrase mining is important in various tasks including automatic term recognition, document indexing, keyphrase extraction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. Recently, a few data-driven methods have been developed successfully for extraction of phrases from massive domain-specific text. However, none of the state-of-the-art models is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which can achieve high performance with minimal human effort. Two new techniques have been developed: (1) by leveraging knowledge bases, a robust positive-only distant training method can avoid extra human labeling effort; and (2) when the part-of-speech (POS) tagger is available, a POS-guided phrasal segmentation model can better understand the syntactic information for the particular language and further enhance the performance by considering the context. Note that, AutoPhrase can support any language as long as a general knowledge base (e.g., Wikipedia) in that language are available, while benefitting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, the new method has shown significant improvements on effectiveness on five real-world datasets in different domains and languages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constructing Structured Information Networks from Massive Text Corpora

In today’s computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information from various domains (medical records, corporate reports). To turn such massive unstructured text data into actionable knowledge, one of the gra...

متن کامل

TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task

We participate in the BioNLP 2013 Shared Task with Turku Event Extraction System (TEES) version 2.1. TEES is a support vector machine (SVM) based text mining system for the extraction of events and relations from natural language texts. In version 2.1 we introduce an automated annotation scheme learning system, which derives task-specific event rules and constraints from the training data, and ...

متن کامل

Mining Key Phrase Translations from Web Corpora

Key phrases are usually among the most information-bearing linguistic structures. Translating them correctly will improve many natural language processing applications. We propose a new framework to mine key phrase translations from web corpora. We submit a source phrase to a search engine as a query, then expand queries by adding the translations of topic-relevant hint words from the returned ...

متن کامل

Robust Parsing, Error Mining, Automated Lexical Acquisition, And Evaluation

In our attempts to construct a wide coverage HPSG parser for Dutch, techniques to improve the overall robustness of the parser are required at various steps in the parsing process. Straightforward but important aspects include the treatment of unknown words, and the treatment of input for which no full parse is available. Another important means to improve the parser's performance on unexpected...

متن کامل

Scalable Phrase Mining for Ad-hoc Text Analytics

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1702.04457  شماره 

صفحات  -

تاریخ انتشار 2017